Using machine learning to determine the nationalities of the fastest 100-mile ultra-marathoners and identify top racing events

The present study intended to determine the nationality of the fastest 100-mile ultra-marathoners and the country/events where the fastest 100-mile races are held. A machine learning model based on the XG Boost algorithm was built to predict the running speed from the athlete’s age (Age group), gender (Gender), country of origin (Athlete country) and where the race occurred (Event country). Model explainability tools were then used to investigate how each independent variable influenced the predicted running speed. A total of 172,110 race records from 65,392 unique runners from 68 different countries participating in races held in 44 different countries were used for analyses. The model rates Event country (0.53) as the most important predictor (based on data entropy reduction), followed by Athlete country (0.21), Age group (0.14), and Gender (0.13). In terms of participation, the United States leads by far, followed by Great Britain, Canada, South Africa, and Japan, in both athlete and event counts. The fastest 100-mile races are held in Romania, Israel, Switzerland, Finland, Russia, the Netherlands, France, Denmark, Czechia, and Taiwan. The fastest athletes come mostly from Eastern European countries (Lithuania, Latvia, Ukraine, Finland, Russia, Hungary, Slovakia) and also Israel. In contrast, the slowest athletes come from Asian countries like China, Thailand, Vietnam, Indonesia, Malaysia, and Brunei. The difference among male and female predictions is relatively small at about 0.25 km/h. The fastest age group is 25–29 years, but the average speeds of groups 20–24 and 30–34 years are close. Participation, however, peaks for the age group 40–44 years. The model predicts the event location (country of event) as the most important predictor for a fast 100-mile race time. The fastest race courses were occurred in Romania, Israel, Switzerland, Finland, Russia, the Netherlands, France, Denmark, Czechia, and Taiwan. Athletes and coaches can use these findings for their race preparation to find the most appropriate racecourse for a fast 100-mile race time.


Introduction
Ultra-endurance events lasting more than 6 hours include ultramarathon running races [1].The popularity of these events has increased significantly in the last 25 years, particularly in ultra-marathon races, where an exponential increase in participation has been observed [2,3].The ultra-marathon race distance of 100 miles (161 km) is highly popular, especially in the United States [4].The high popularity of the 100-mile race distance among ultra-marathoners has led to a high level of scientific interest among researchers [5][6][7][8][9].The main topics of scientific interest have included fluid and electrolyte metabolism, the heart structure and function of the 100-mile ultra-marathoner, successful race performance, pacing, nutrition, anthropometry, age, mental toughness, sleep, muscle damage, skeletal and renal health, overuse injuries, metabolomics, and pain perception [5][6][7].A large number of studies have been performed about the 'Western States Endurance Run' [8,9], where the first paper appeared in 1987 [6].
The largest body of research has been performed on fluid and electrolyte metabolism [10], focusing on specific aspects such as exercise-associated hyponatremia [11,12] and fluid metabolism [13,14].Considering the heart of the 100-mile ultra-marathoners, aspects such as cardiac adaptation [15], the heart rate variability [5], alterations in cardiac mechanics [16], and the right ventricle [17] were of scientific interest.Other aspects of 100-mile ultra-marathoners such as the age of the best performance [18], the sex difference in performance [19], age group performances [18], physiological aspects [20], nutrition [21], mental toughness [7], pacing [22], and the use of non-steroidal anti-inflammatory drugs [23,24] have also been studied extensively.
Since most of the research on 100-mile ultra-marathoners has been performed in races held in the United States, most 100-mile runners originate from that country.However, a study conducted to understand the best countries competing at 100 miles, using a macro-to-micro analysis, showed that most of the athletes were from the American and European continents, despite the observation of the fastest being from Africa [25].An analysis of each continent showed that women from Sweden, Hungary and Russia presented the best performances in the top three, top 10 and top 100, while the fastest men were from Brazil, Russia and Lithuania [25].However, we do not know (i) athletes of which nationality are the fastest in 100-mile ultra-marathon running and (ii) where the fastest 100-mile race courses are located worldwide.In this context, we undertook this research to determine the country of origin of the fastest runners and the location of the fastest race courses.These insights would help athletes and coaches to better plan their race strategy to obtain a fast race time.We hypothesized, based on these recent findings, that our study would confirm recent findings that most of the 100-mile runners would originate from Europe and America, that the fastest 100-mile runners would be found to originate from Africa, and that the fastest race courses would be situated in Africa, Europe and/or America.

Ethical approval
This study was approved by the Institutional Review Board of Kanton St. Gallen, Switzerland, with a waiver of the requirement for informed consent of the participants as the study involved the analysis of publicly available data (EKSG 01/06/2010).The study was conducted following recognized ethical standards according to the Declaration of Helsinki adopted in 1964 and revised in 2013.

Data set and data preparation
The race data was obtained from DUV Ultra-Marathon Statistik (statistik.d-u-v.org/geteventlist.php) by the end of 2022.The data were accessed July 11, 2023, for research purposes.The raw 100-mile sample contained 172,394 race records, with the United States accounting for around 70% of the sample, while there were also numerous countries with just one or two race records.Each race record included the athlete's name, age group, gender and country of origin, the race location and year, the race distance, and the athlete's race time, from which the race speed was calculated.ISO3 codes were used for the country information.After discarding any incomplete or incorrect instances and filtering out countries with a very low number of records, a total of 172,110 race records from 65,392 unique runners from 68 different countries participating in races held in 44 different countries were used for analyses.To minimize the potential effect of outliers, a minimum of 10 race records was set per country to qualify for the analysis.

Statistical analysis
First, two independent ranking tables were created, aggregating the race records by country of origin and event, and then sorting each list of countries by number of race records.To reduce noise and ensure that the results were statistically representative, race records from athlete countries with less than 15 records or less than five unique runners were removed, and race records from event countries with less than 10 records were removed.Descriptive statistical results for each country are summarized in the ranking tables, where the tables index also serves as a key to the Partial Dependence Plots (PDP).We then built and evaluated a non-linear machine learning (ML) predictive regression model and looked into the model logic through some explainability tools.The algorithm used for building the model is the popular XG Boost.XG Boost (xgboost.readthedocs.io/en/stable/)belongs to the family of gradientboosting tree-ensemble algorithms and is widely used to solve classification and regression problems in data science.
XG Boost regression model.The model was designed to use the following variables as predictors or inputs to the model: "Athlete_gender_ID", "Age_group_ID", "Athlete_country_ID", and "Event_country_ID".The predicted variable, or algorithm output, was the Race (running) speed (km/h).Before the data could be fit in the model, the predictors had to be numerically encoded, Th.Athlete gender variable was encoded as female = 0 and male = 1.The Age group variable was already numerically encoded in 5-year age groups (except group 18, which represents runners of less than 20 years, and group 75, which represents 75 years and older).The Athlete country and Event country variables were encoded based on their position in the respective ranking tables.Fig 1 illustrates the setup, with the variables used as predictors or inputs and the race (running) speed prediction as the model output.
Model training and evaluation strategy.A hold-out evaluation strategy was used to train and evaluate the model, executing a simulation with different test splits and combinations of several estimators and learning rates.Two evaluation metrics, MAE (Mean Absolute Error) and R 2 , were calculated.Also, the model relative features importances, partial dependence plots (PDP) and prediction distribution plots were calculated and are displayed in the results section.In addition to the model interpretability analysis, a set of descriptive target plot charts show the predictor values, group sizes, and the group's average speed, helping to set expectations for the PDP and prediction charts.
After several iterations and tests, the optimal model parameters and accuracy scores were: • 500 estimators (learners or trees) • Learning rate of 0.5 • R 2 score of 0.23 (in-sample test) • MAE of 0.87 km/h Model interpretation.The 'optimal' model accuracy score of R 2 = 0.23 indicates an existing but moderately weak effect of the predicting variables in the model output.To assess how each predictor contributed to the model output, we computed the importance of the model's relative features, the PDP plots, and the model prediction distributions.The PDP plots show the relative amount of change on the model output for each predicting variable's different values with respect to a reference value (value 0).The prediction distribution plots use boxplots to show the distribution of the model predictions of average race speed.Descriptive statistical values are given in terms of frequencies (counts), mean, standard deviation (std), minimum values (min), and maximum values (max), and also with median values (in the box plots).All computation and analysis were done using a Jupyter Notebook (Google Colab) and Python and associated libraries (pandas, numpy, xgboost, pdpbox, sklearn, matplotlib, sns).

Results
The qualifying sample used for analysis consists of 172,110 race records from 65,392 unique runners from 68 different countries participating in races held in 44 different countries.Table 1 presents the country rankings by number of race records and unique runners.The mean race speed is color-coded, with darker colors corresponding to higher values (faster running speed).The first column in the ranking tables is the index to interpret the PDP charts.The United States accounted for the highest participation in both athlete country and event country rankings, followed by Great Britain, Canada, South Africa, Japan, Germany, and Australia.

Event country ranking
The country of event ranking table, with 42 countries, is shown in Table 2. Most runners competed in races held in the United States, Great Britain, South Africa, Japan, and Germany.

Model features relative importances
The 'optimal' model can only explain 23% of the race speed variability through the four predictors at best, indicating that additional predicting variables should be added to the model in order to improve its accuracy.The model (Fig 2 ) rates Event country (0.49) as the most important predictor (based on data entropy reduction), followed by Athlete country (0.24), Age group (0.15), and Gender (0.13).

Partial dependence plots (PDP)
The PDP plot shows the following: Model outputs are around 0.26 km/h higher for males than for females (Fig 3

Prediction distributions and target plots
The target plots represent a descriptive visualization of the 100 km race dataset by predictor and show the groups' sizes and average speeds.The prediction plots show the distribution of the XG Boost model output (the predicted race speed) by predictor value through a set of

Discussion
The present study aimed to determine the country of origin of the fastest 100-mile runners and the countries hosting the fastest 100-mile race courses using an XG Boost regression model.We found that the event location (i.e. the country where the race is held) was the most important predictor for a fast 100-mile race time where the fastest race courses are offered in Romania, Israel, Switzerland, Finland, Russia, the Netherlands, France, Denmark, Czechia, and Taiwan.Regarding the first aim, the fastest athletes come mostly from Eastern European countries (i.e., Lithuania, Latvia, Ukraine, Finland, Russia, Hungary, and Slovakia).

The fastest race courses
The first important finding was that the country of the event was the most important feature concerning the XG Boost model's predictive power.The countries with the fastest 100-mile events were Romania, Israel, Switzerland, Finland, Russia, the Netherlands, France, Denmark, Czechia, and Taiwan.Therefore, we could confirm our hypothesis only for Europe, not for Africa and/or America.Common to these races or race courses was the fact they were roadbased flat courses on small loops.In some instances, the races recorded the 100-mile split times in a longer or longer race, such as a 24-hour race.In other instances, the races were held as indoor races.In a few instances, the races were held as Championships, such as European or World Championships.In more detail, in Romania, the 'IAU 24 h European Championship' was held in Timisoara in 2018, where the 100-mile split times were taken.The race is a roadbased ultra-marathon held on a 1,236 m long asphalt loop (http://s24h.ro/).Importantly, Aleksandr Sorokin from Lithuania passed the 100 miles in 12:50:26 h:min:s.In Israel, the 'Spartanion 100 Miles Race' has been held since 2020 in Ganei Yehoshua Park, Tel Aviv, on a 1,459 m long circular, fast and clean course (https://spartanion.com/).In Switzerland, the '24 heures de Lausanne' recorded 1981 a 100-mile split time with a time of 12:28:16 h:min:s.Furthermore, the '24-Stundenlauf Aare-Insel Brugg' (www.24stundenlauf.ch)and the 'Self-Transcendence 24h Lauf Basel' (https://ch.srichinmoyraces.org/self-transcendence-1224-stunden-lauf-basel)recorded 100-mile split times.In addition, in 1993, the 2 nd 'IAU 24h EC Basel' was held with 100 miles split times.In Finland, the 'Endurance 24 h Ultrarun Espoo' has been held since 2010 and the 100-mile split times were taken.The course is a 390,04 m mondo-surfaced indoor track at Esport Ratiopharm Arena in Tapiola Sports Center, Espoo (https://endurance.fi/e24).Different 100-mile races have been held in Russia, such as the '24h 'Sutki Begom' Moskau' In Czechia, a 100-mile split time was recorded in the 'Brno Spring 48 Hour Indoor' as an indoor run.Later, split times were recorded in the 'Self-Transcendence Race 24h Kladno' and the 'Be ˇh na 24 hodin Pilsen' as an indoor run.The finding that the country of the race is the most important predictor of performance might be attributed to topographic characteristics, environmental conditions, and runners' preference for specific races to achieve optimal performance.Concerning topographic characteristics, it is observed that most countries with the fastest races share the common feature of flat terrains.In contrast, most of these countries have a continental climate favoring the achievement of fast race.It is also well known that the training process follows the principle of periodization [26,27], according to which the training is divided into specific phases where the characteristics of exercise (e.g., intensity, volume, recovery and mode) are manipulated to peak performance at a certain time.In this context, runners are assumed to participate in a race that fits within their training plan.In addition, a specific race may be selected in terms of reputation (a race can be considered more important than another), where it is already known that other high-level runners intend to participate, and this leads to a sequence of reciprocal cause and effect in which: the fast runners choose fast races to compete, and in turn, the participation of fast runners ensure that a fast race remains fast.

The fastest runners
In contrast to a recent study reporting that the fastest 100-mile ultra-marathoners were women from Sweden, Hungary and Russia and men from Brazil, Russia and Lithuania [25], we found that runners from Lithuania, Latvia, Ukraine, Finland, Russia, Hungary, and Slovakia obtained the fastest running speeds.In the first instance, we found that 35 runners from Lithuania were among the fastest.Although it might be possible that one or a few runners from the same country could bias the result, the best Lithuanian ultra-marathoner, Aleksandr 'Sania' Sorokin, has finished only four 100-mile races.Still, with the world record 100 miles on the track in 2021 in the 'Centurion Running Track 100 Mile' in the United Kingdom and the 100 miles on road in the 'Sparanion Race' in 2022 in Israel (www.irunfar.com/aleksandr-sorokin-150-kilometer-100-mile-and-12-hour-world-record-holder-interview).Therefore, 31 race records must be from other fast Lithuanian ultra-marathoners.It should be highlighted that the fastest runners in the present study originated from countries that shared geographical, cultural, and socioeconomical characteristics.Furthermore, a recent review reported a dominance of Russian athletes in ultra-marathon running and suggested as potential explanations a possible misuse of performance-enhancing substances, historical, climate-geographical, and psychophysiological (e.g., a combination of genetic and social) factors [28].Although most 100-mile runners were from the United States, US runners are not among the fastest.In the US, plenty of 100-mile races are held, and most are trail runs (https://runningintheusa.com/classic/list/map/100m).One of the most traditional 100-mile races is the 'Western States 100 Mile Endurance Run' held since 1976 (www.ws100.com/).Another 100-mile race with a long tradition is the 'Old Dominion 100 Mile Endurance Run', which started in 1979 (www.olddominionrun.org/).The greater participation of US runners in the race may shift the average time downwards, which does not necessarily mean that they have lower times than other nationalities.The relatively large number of US-American finishers in this race distance indicated that these runners could be more 'recreational' than those from other countries (who, in turn, could be considered more 'selective') and might partially explain that they were not among the fastest nationalities.

The age of peak performance
We also found that athletes in the age group 25 (25-29 years) were the fastest in the 100-mile race distance.This age is significantly lower than that found in a study of 35,956 finishes (6,862 women and 29,094 men) in 100-mile ultramarathons between 1998 and 2011.The annual top ten fastest runners had an average age of ~39 and ~37 years for women and men [29].The difference to the present results might be that the present study considered all athletes, whereas the existing study was restricted to the annual ten fastest.Furthermore, the relatively young age of the fastest finishers in our study might be explained in terms of 'selectiveness' variation by age group.The number of finishers in the 25-29 age group is three to four times less than that in the age groups 35-39, 40-44 and 45-49, suggesting that the athletes in the former one might be considered as more 'selected' compared to the more 'recreational' athletes of the latter groups.In another way, these are interesting findings, indicating that young runners, when well-trained, can perform well in ultramarathon events.

Limitations
Although this study uses a very large data set and highly sophisticated analyses, we must acknowledge some limitations.We found that the fastest running speeds were obtained by runners from Lithuania, Latvia, Ukraine, Finland, Russia, Hungary, and Slovakia-countries with partially low numbers of runners.Since we did not account for repeated measures, one or two outstanding athletes from these countries could be responsible for the country's performance.However, as described by Aleksandr 'Sania' Sorokin, only one athlete cannot achieve all the best race results for one country.Aspects such as training, previous experience [30], motivation [31], drafting [32], pre-race nutrition [33], and environmental conditions [34] could not be considered.We must also be aware that these race courses might not all have been exactly measured, so some very fast race courses might not have the full length of 100 miles (161 km).Another limitation is associated with the available information.With only four predictors, the model could only be very general.More realistic models could be built by collecting additional runner-specific data and mixing it with the available data.

Conclusion
In summary, the event location (i.e. the country where the race is held) is the most important predictor for a fast 100-mile race time, according to our XG Boost regression model.The fastest race courses occurred in Romania, Israel, Switzerland, Finland, Russia, the Netherlands, France, Denmark, Czechia, and Taiwan.Common to these races or race courses is the fact they are held on a road-based flat course on small loops.In some instances, the races took the 100-mile split times in a longer race, such as a 24-hour race or longer.In other instances, the races were held as indoor races.In a few instances, the races were held as European or World Championships.Athletes and coaches can use these findings for their race preparation to find the most appropriate race course for a fast 100-mile race time.For example, running a 24-hour race (often flat and circular) might be better to try to break 100-mile personal best time, thus combining two "races" in one, than running some challenging 100-mile race.

Fig 1 .
Fig 1. XG Boost model.https://doi.org/10.1371/journal.pone.0303960.g001 boxplots.The difference among male and female predictions is relatively small at about 0.23 km/h (Fig 7).The fastest age group is 25-29 years, but the average speeds of groups 20-24 years and 30-34 years stay close (Fig 8).Participation, however, peaks in the age group 40-44 years.The model replicates predictions that loosely follow the average speed curve in the

Table 1 . Athlete country ranking table.
).The highest model outputs are given to runners in age groups 25-29 years and 30-34 years (Fig 4).Athlete country ID 54 (Lithuania) shows a distinct peak, matching the highest mean speed in the ranking table (Fig 5).Event country IDs 16, 19, 23 and 34 obtain the highest peaks in the corresponding PDP chart, although only 23 (Switzerland) and 34 (Romania) are among the fastest in the ranking table (Fig 6).